Optionally disable logging in the data sampler to support predict_step #10127

jstjohn · 2024-08-13T17:36:38Z

Pytorch lightning's predict_step does not support logging on the module. This PR adds an option to the data sampler to disable logging which allows the predict step to work.

tests/collections/llm/test_mnist_model_nemo2.py

+    raise ValueError("non-None value not found")
+
+
+def get_dtype_device(torch_object) -> Tuple[torch.dtype, torch.device]:  # noqa: D103


tests/collections/llm/test_mnist_model_nemo2.py

+
+
+# NOTE(SKH): These types are all wrong, but are close. The inner type must always be a torch.Tensor, but the outer container should be generic.
+def batch_collator(batches: Optional[Union[Tuple[ReductionT], List[ReductionT]]]) -> Optional[ReductionT]:


tests/collections/llm/test_mnist_model_nemo2.py

+        case [list(), *_]:
+            return [batch_collator([batch[i] for batch in batches]) for i in range(len(batches[0]))]
+        case None:
+            return None


ko3n1g · 2024-08-14T19:06:52Z

stopping this for now until our CI is out of maintenance

nemo/lightning/pytorch/plugins/data_sampler.py

ashors1 · 2024-08-21T05:42:45Z

Looks good, thanks! Can we clean up some of the old comments in the test script?

Signed-off-by: John St John <jstjohn@nvidia.com>

…s to cwd Signed-off-by: John St John <jstjohn@nvidia.com>

Signed-off-by: John St John <jstjohn@nvidia.com>

ashors1

LGTM! Thanks!

NVIDIA#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com>

#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com>

NVIDIA#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com> Signed-off-by: adityavavre <aditya.vavre@gmail.com>

jstjohn requested review from marcromeyn and ashors1 August 13, 2024 17:36

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from baf1c1f to c583c91 Compare August 13, 2024 17:54

github-advanced-security bot found potential problems Aug 13, 2024

View reviewed changes

jstjohn added the Run CICD label Aug 13, 2024

jstjohn self-assigned this Aug 13, 2024

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch 2 times, most recently from f278078 to 479e2d8 Compare August 14, 2024 19:01

jstjohn added Run CICD and removed Run CICD labels Aug 14, 2024

ko3n1g removed the Run CICD label Aug 14, 2024

ko3n1g added the Run CICD label Aug 15, 2024

jstjohn added Run CICD and removed Run CICD labels Aug 15, 2024

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from ac2075b to b62a392 Compare August 15, 2024 21:04

jstjohn added Run CICD and removed Run CICD labels Aug 15, 2024

jstjohn commented Aug 15, 2024

View reviewed changes

nemo/lightning/pytorch/plugins/data_sampler.py Outdated Show resolved Hide resolved

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from eddcbb4 to 6f4e2d9 Compare August 15, 2024 21:21

jstjohn added Run CICD and removed Run CICD labels Aug 15, 2024

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from 6f4e2d9 to 5766f47 Compare August 16, 2024 16:35

jstjohn added Run CICD and removed Run CICD labels Aug 16, 2024

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from e2672ee to 88d9df9 Compare August 16, 2024 19:00

jstjohn added Run CICD and removed Run CICD labels Aug 16, 2024

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from 88d9df9 to 92d6307 Compare August 16, 2024 21:18

jstjohn removed the Run CICD label Aug 16, 2024

jstjohn added Run CICD and removed Run CICD labels Aug 20, 2024

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from d0258da to 0da36a9 Compare August 20, 2024 21:11

jstjohn added Run CICD and removed Run CICD labels Aug 20, 2024

jstjohn added 11 commits August 21, 2024 19:35

Resolve merge conflicts with consumed sample logging

9fa0364

Signed-off-by: John St John <jstjohn@nvidia.com>

Add test file that captures the predict step error

6374c2e

Signed-off-by: John St John <jstjohn@nvidia.com>

Add fixme comment around proper checkpoint nemo2 handling

c6c93ac

Signed-off-by: John St John <jstjohn@nvidia.com>

Skip megatron training test on CPU nodes

7058723

Signed-off-by: John St John <jstjohn@nvidia.com>

Move output_log to last arg for compatibility

e391c72

Signed-off-by: John St John <jstjohn@nvidia.com>

try setting the default root dir in predict to avoid writing artifact…

9997393

…s to cwd Signed-off-by: John St John <jstjohn@nvidia.com>

Handle the new check for batch samplers to enable predict_step

3720193

Signed-off-by: John St John <jstjohn@nvidia.com>

Only reset the global microbatch, not entire parallel state

70fe6fa

Signed-off-by: John St John <jstjohn@nvidia.com>

Destroy the right sets of state in test of lightning trainer

8c1ea86

Signed-off-by: John St John <jstjohn@nvidia.com>

Fix typo and rename state resetting functions

dfdf426

Signed-off-by: John St John <jstjohn@nvidia.com>

Run test in a subprocess to avoid contaminating global state

a6ff157

Signed-off-by: John St John <jstjohn@nvidia.com>

jstjohn force-pushed the jstjohn/optional_logging_for_predict branch from 236f257 to a6ff157 Compare August 21, 2024 19:35

jstjohn added Run CICD and removed Run CICD labels Aug 21, 2024

ashors1 approved these changes Aug 21, 2024

View reviewed changes

jstjohn merged commit ff7c614 into main Aug 21, 2024
129 of 130 checks passed

jstjohn deleted the jstjohn/optional_logging_for_predict branch August 21, 2024 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally disable logging in the data sampler to support predict_step #10127

Optionally disable logging in the data sampler to support predict_step #10127

jstjohn commented Aug 13, 2024

ko3n1g commented Aug 14, 2024

ashors1 commented Aug 21, 2024

ashors1 left a comment

		raise ValueError("non-None value not found")


		def get_dtype_device(torch_object) -> Tuple[torch.dtype, torch.device]: # noqa: D103



		# NOTE(SKH): These types are all wrong, but are close. The inner type must always be a torch.Tensor, but the outer container should be generic.
		def batch_collator(batches: Optional[Union[Tuple[ReductionT], List[ReductionT]]]) -> Optional[ReductionT]:

Optionally disable logging in the data sampler to support predict_step #10127

Optionally disable logging in the data sampler to support predict_step #10127

Conversation

jstjohn commented Aug 13, 2024

ko3n1g commented Aug 14, 2024

ashors1 commented Aug 21, 2024

ashors1 left a comment

Choose a reason for hiding this comment